Exploring Places

Screen%20Shot%202022-01-30%20at%2011.24.35.png

Experiments:

    1. Visualizing Places dataset
    1. Exploring Tags Places
    1. Exploring Towns & Places Names
    1. Exploring Properities
    1. Exploring Descriptions Places Similarities
    1. Descriptions Places Topic Modelling
In [1]:
import json
import pandas as pd
import plotly.express as px
import os
import plotly.graph_objects as go
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from bertopic import BERTopic
In [2]:
#data="places.json"
data="dataset/sample_20210501.json"
with open(data, 'r') as f:
    data = json.load(f)
    print(len(data["places"]))
    places=data["places"]
df = pd.DataFrame(places)
531

2. Visualizing the places dataframe

In [3]:
df["properties"].iloc[0]
Out[3]:
{'place.child-restrictions': True,
 'place.facilities.free-wifi': True,
 'place.facilities.dogs-allowed': False,
 'place.facilities.parking': True,
 'place.facilities.toilets': True,
 'place.facilities.toilets_disabled': False,
 'place.facilities.wheelchair-access': False,
 'place.capacity.max': '160'}
In [4]:
df.shape[0]
Out[4]:
531

Experiment 1: Exploring Place Ids

In [5]:
df_ids=df.groupby(['place_id']).size().reset_index()
df_ids=df_ids.rename(columns={0: "number_of_times"}).sort_values(by=['number_of_times'], ascending=False)
df_ids
Out[5]:
place_id number_of_times
0 1 1
349 58123 1
363 61213 1
362 61204 1
361 61136 1
... ... ...
171 18896 1
170 18603 1
169 18572 1
168 16573 1
530 131091 1

531 rows × 2 columns

Experiment 2: Exploring Tags Places

We are going to separete the elements stored in each tag list into new rows.

In [6]:
df["tags"][0:5]
Out[6]:
0        [Bar & pub food, Comedy, Restaurants, Venues]
1                         [Conference Centres, Venues]
2                                   [Theatres, Venues]
3    [Cathedrals, Church, Churches, Galleries, Publ...
4    [Church, Churches, Galleries, Place Of Worship...
Name: tags, dtype: object
In [7]:
df_tags=df.explode('tags')
In [8]:
df_tags
Out[8]:
address email postal_code properties sort_name town website place_id modified_ts created_ts name loc country_code tags descriptions phone_numbers status
0 5 York Place admin@thestand.co.uk EH1 3EB {'place.child-restrictions': True, 'place.faci... Stand Edinburgh http://www.thestand.co.uk 1 2021-11-24T12:18:33Z 2021-11-24T12:18:33Z The Stand {'latitude': '55.955806109395006', 'longitude'... GB Bar & pub food [{'type': 'description.list.default', 'descrip... {'info': '0131 558 7272', 'box_office': '0131 ... live
0 5 York Place admin@thestand.co.uk EH1 3EB {'place.child-restrictions': True, 'place.faci... Stand Edinburgh http://www.thestand.co.uk 1 2021-11-24T12:18:33Z 2021-11-24T12:18:33Z The Stand {'latitude': '55.955806109395006', 'longitude'... GB Comedy [{'type': 'description.list.default', 'descrip... {'info': '0131 558 7272', 'box_office': '0131 ... live
0 5 York Place admin@thestand.co.uk EH1 3EB {'place.child-restrictions': True, 'place.faci... Stand Edinburgh http://www.thestand.co.uk 1 2021-11-24T12:18:33Z 2021-11-24T12:18:33Z The Stand {'latitude': '55.955806109395006', 'longitude'... GB Restaurants [{'type': 'description.list.default', 'descrip... {'info': '0131 558 7272', 'box_office': '0131 ... live
0 5 York Place admin@thestand.co.uk EH1 3EB {'place.child-restrictions': True, 'place.faci... Stand Edinburgh http://www.thestand.co.uk 1 2021-11-24T12:18:33Z 2021-11-24T12:18:33Z The Stand {'latitude': '55.955806109395006', 'longitude'... GB Venues [{'type': 'description.list.default', 'descrip... {'info': '0131 558 7272', 'box_office': '0131 ... live
1 Ingliston NaN EH28 8NB {'place.capacity.max': '35000'} Royal Highland Centre Edinburgh http://www.royalhighlandcentre.co.uk 376 2020-01-27T10:18:15Z 2020-01-27T10:18:15Z Royal Highland Centre {'latitude': '55.94067800', 'longitude': '-3.3... GB Conference Centres [{'type': 'description.list.default', 'descrip... {'box_office': '0131 335 6200'} live
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
528 Starbank Road NaN EH5 3BX NaN Starbank Park Edinburgh NaN 130944 2021-09-28T10:37:46Z 2021-09-28T10:37:46Z Starbank Park {'latitude': '55.97939906204032', 'longitude':... GB Parks NaN NaN live
528 Starbank Road NaN EH5 3BX NaN Starbank Park Edinburgh NaN 130944 2021-09-28T10:37:46Z 2021-09-28T10:37:46Z Starbank Park {'latitude': '55.97939906204032', 'longitude':... GB Venues NaN NaN live
529 41 Thistle Street NaN EH2 1EN NaN Blunt Knife Co. Edinburgh NaN 130974 2021-10-01T11:09:08Z 2021-10-01T11:09:08Z Blunt Knife Co. {'latitude': '55.95388975515418', 'longitude':... GB Art Gallery [{'type': 'description.official', 'description... NaN live
530 52 Bridge Road NaN EH13 0LQ NaN Colinton Arts Edinburgh NaN 131091 2021-10-11T17:25:26Z 2021-10-11T17:25:26Z Colinton Arts {'latitude': '55.90682300', 'longitude': '-3.2... GB Galleries NaN NaN live
530 52 Bridge Road NaN EH13 0LQ NaN Colinton Arts Edinburgh NaN 131091 2021-10-11T17:25:26Z 2021-10-11T17:25:26Z Colinton Arts {'latitude': '55.90682300', 'longitude': '-3.2... GB Venues NaN NaN live

1402 rows × 17 columns

In [9]:
g_tags=df_tags.groupby(['tags']).size().reset_index()
g_tags=g_tags.rename(columns={0: "number_of_times"}).sort_values(by=['number_of_times'], ascending=False)
g_tags
Out[9]:
tags number_of_times
215 Venues 108
154 Outdoors 106
101 Galleries 82
105 Gardens 65
160 Public buildings 62
... ... ...
68 Conference centre 1
70 Country House 1
137 Monuments 1
71 Country Parks 1
0 Abbey 1

235 rows × 2 columns

In [10]:
px.histogram(g_tags, x="tags", y="number_of_times", histfunc="sum", color="tags", title='Frequency of tags places')

Experiment 3: Exploring Towns & Names

In [11]:
df["town"][1:10]
Out[11]:
1    Edinburgh
2    Edinburgh
3    Edinburgh
4    Edinburgh
5    Edinburgh
6    Edinburgh
7    Edinburgh
8    Edinburgh
9    Edinburgh
Name: town, dtype: object

3.1 Frequency of places grouped by towns

In [12]:
df_town=df.dropna(subset=['town'])
town=df_town.groupby(['town']).size().reset_index()
town=town.rename(columns={0: "number_of_times"})
town=town.drop([0])
In [13]:
town=town.sort_values(by=['number_of_times'], ascending=False)
town
Out[13]:
town number_of_times
29 Edinburgh 314
77 St Andrews 16
49 Kirkcaldy 13
66 North Berwick 13
24 Dunfermline 13
... ... ...
46 Kelty 1
47 Kincardine 1
48 Kinghorn 1
50 Kirkliston 1
87 Wormit 1

87 rows × 2 columns

In [14]:
px.scatter(town, x='town', y='number_of_times', color='number_of_times',  size="number_of_times", size_max=60, title="Frequency of places grouped by towns")

3.2 Frequency of places grouped by name

In [15]:
df_name_town=df.groupby(['name']).size().reset_index()
df_name_town=df_name_town.rename(columns={0: "number_of_times"})
df_name_town=df_name_town.sort_values(by=['number_of_times'], ascending=False)
df_name_town.reset_index()
Out[15]:
index name number_of_times
0 392 Styx 2
1 0 101 Greenbank Crescent 1
2 355 Southside Community Centre 1
3 349 Sketchy Beats Cafe 1
4 350 Skydive St Andrews 1
... ... ... ...
525 171 Grassmarket Community Project 1
526 170 Granton Gasworks 1
527 169 Gosford House 1
528 168 Gooey Events 1
529 529 theSpace on the Mile 1

530 rows × 3 columns

3.3. Frequency of places grouped by name and town

In [16]:
df_name_town=df.groupby(['name', 'town']).size().reset_index()
df_name_town=df_name_town.rename(columns={0: "number_of_times"})
df_name_town=df_name_town.sort_values(by=['number_of_times'], ascending=False)
df_name_town
Out[16]:
name town number_of_times
0 101 Greenbank Crescent Edinburgh 1
349 Sketchy Beats Cafe Edinburgh 1
363 St Andrews Heritage Museum & Garden St Andrews 1
362 St Andrews Castle St Andrews 1
361 St Andrews Art Club St Andrews 1
... ... ... ...
171 Grassmarket Community Project Edinburgh 1
170 Granton Gasworks Edinburgh 1
169 Gosford House Longniddry 1
168 Gooey Events Livingston village 1
530 theSpace on the Mile Edinburgh 1

531 rows × 3 columns

Experiment 4: Exploring Properities

In [17]:
df_properties=pd.concat([df.drop(['properties'], axis=1), df['properties'].apply(pd.Series)], axis=1)
In [18]:
df_properties[0:3]
Out[18]:
address email postal_code sort_name town website place_id modified_ts created_ts name ... place.child-restrictions place.facilities.dogs-allowed place.facilities.free-wifi place.facilities.guide-dogs place.facilities.hearing-loop place.facilities.parking place.facilities.toilets place.facilities.toilets_baby-changing place.facilities.toilets_disabled place.facilities.wheelchair-access
0 5 York Place admin@thestand.co.uk EH1 3EB Stand Edinburgh http://www.thestand.co.uk 1 2021-11-24T12:18:33Z 2021-11-24T12:18:33Z The Stand ... True False True NaN NaN True True NaN False False
1 Ingliston NaN EH28 8NB Royal Highland Centre Edinburgh http://www.royalhighlandcentre.co.uk 376 2020-01-27T10:18:15Z 2020-01-27T10:18:15Z Royal Highland Centre ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 54 George Street enquiries@assemblyroomsedinburgh.co.uk EH2 2LR Assembly Rooms Edinburgh http://www.assemblyroomsedinburgh.co.uk 377 2020-10-12T16:00:45Z 2020-10-12T16:00:45Z Assembly Rooms ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

3 rows × 29 columns

4.1 Frequency of places grouped by wheelchair-access and town

In [19]:
df_properties_wc=df_properties.groupby(['place.facilities.wheelchair-access', 'town']).size().reset_index()
df_properties_wc=df_properties_wc.rename(columns={0: "number_of_times"})
df_properties_wc=df_properties_wc.sort_values(by=['number_of_times'], ascending=False)
df_properties_wc
Out[19]:
place.facilities.wheelchair-access town number_of_times
19 True Edinburgh 60
4 False Edinburgh 32
17 True Dunfermline 4
29 True Musselburgh 3
25 True Livingston 2
34 True St Andrews 2
30 True North Berwick 2
23 True Kirkcaldy 2
24 True Linlithgow 2
0 False Bathgate 1
21 True Glenrothes 1
22 True Haddington 1
28 True Melrose 1
26 True Livingston village 1
27 True Lochgelly 1
31 True Peebles 1
32 True Selkirk 1
33 True South Queensferry 1
20 True Falkland 1
18 True East Linton 1
1 False By Collessie 1
16 True Dirleton 1
15 True Cupar 1
14 True Cockenzie 1
13 True Bathgate 1
12 True Anstruther 1
11 True Aberlady 1
10 False Wilkieston 1
9 False St Andrews 1
8 False South Queensferry 1
7 False Newport-on-Tay 1
6 False Hawick 1
5 False Haddington 1
3 False Duns 1
2 False Dalkeith 1
35 True Tranent 1

4.2 Frequency of places grouped by toilets_disabled and town

In [20]:
df_properties_td=df_properties.groupby(['place.facilities.toilets_disabled', 'town']).size().reset_index()
df_properties_td=df_properties_td.rename(columns={0: "number_of_times"})
df_properties_td=df_properties_td.sort_values(by=['number_of_times'], ascending=False)
df_properties_td
Out[20]:
place.facilities.toilets_disabled town number_of_times
19 True Edinburgh 57
7 False Edinburgh 31
18 True Dunfermline 3
23 True Kirkcaldy 2
24 True Linlithgow 2
25 True Livingston 2
8 False Haddington 2
28 True Musselburgh 2
29 True North Berwick 2
32 True St Andrews 2
30 True Selkirk 1
27 True Melrose 1
26 True Lochgelly 1
31 True South Queensferry 1
33 True Tranent 1
22 True Hawick 1
21 True Glenrothes 1
20 True Falkland 1
0 False Aberlady 1
17 True Dalkeith 1
1 False Bathgate 1
16 True Cockenzie 1
15 True Bathgate 1
14 True Anstruther 1
13 False St Andrews 1
12 False South Queensferry 1
11 False Peebles 1
10 False Musselburgh 1
9 False Livingston village 1
6 False East Linton 1
5 False Duns 1
4 False Dunfermline 1
3 False Cupar 1
2 False By Collessie 1
34 True Wilkieston 1

5. Exploring Descriptions

In [21]:
df_descriptions=df.explode('descriptions')
df_descriptions=pd.concat([df_descriptions.drop(['descriptions'], axis=1), df_descriptions['descriptions'].apply(pd.Series)], axis=1)
df_descriptions=df_descriptions.dropna(subset=['description']).reset_index()
documents=df_descriptions["description"].values
In [22]:
len(documents)
Out[22]:
222
In [23]:
import re 
from gensim.parsing.preprocessing import remove_stopwords
def clean_documents(text):
    text = re.sub(r'\S*@\S*\s?', '', text, flags=re.MULTILINE) # remove email
    text = re.sub(r'http\S+', '', text, flags=re.MULTILINE) # remove web addresses
    text = re.sub("\'", "", text) # remove single quotes
    text = remove_stopwords(text)
    return text
In [24]:
d=[]
for text in documents:
    d.append(clean_documents(text))

Generating Text Embeddings

In [25]:
model = SentenceTransformer('all-MiniLM-L6-v2')
#Training our text_embeddings - using the descriptions available & all-MiniLM-L6-v2 Transformer
text_embeddings = model.encode(d, batch_size = 8, show_progress_bar = True)

In [26]:
np.shape(text_embeddings)
Out[26]:
(222, 384)

Description Similarity

In [27]:
similarities = cosine_similarity(text_embeddings)
similarities_sorted = similarities.argsort()
id_1 = []
id_2 = []
score = []
for index,array in enumerate(similarities_sorted):
    p=len(array)
    id_1.append(index)
    id_2.append(array[-2])
    score.append(similarities[index][array[-2]])
index_df = pd.DataFrame({'id_1' : id_1,
                          'id_2' : id_2,
                          'score' : score})
print(index_df)
     id_1  id_2     score
0       0    98  0.386197
1       1    80  0.550572
2       2    80  0.648187
3       3     4  0.642733
4       4     3  0.642733
..    ...   ...       ...
217   217   170  0.561509
218   218   214  0.644098
219   219   178  0.640817
220   220    38  0.439870
221   221   192  0.413957

[222 rows x 3 columns]
In [28]:
index_df["score"].sort_values(ascending=False)
Out[28]:
51     0.889471
52     0.889471
70     0.844464
71     0.844464
88     0.827258
         ...   
221    0.413957
39     0.402191
61     0.395569
0      0.386197
160    0.317641
Name: score, Length: 222, dtype: float32
In [36]:
index_df.iloc[51]
Out[36]:
id_1     51.000000
id_2     52.000000
score     0.889471
Name: 51, dtype: float64

NOTE: Documents 51 and 52 seems to be the most similar. Lets see what they have

In [30]:
documents[51]
Out[30]:
"Those of a magpie disposition will love Shiela Dhariwal's array of silver, chunky, ethnic, beaded and stone-studded jewellery; an Aladdin's cave of internationally-sourced, sparkly stuff with a pretty broad price range, in a gorgeous hidden location. Prices range from £6.50-£345"
In [37]:
documents[52]
Out[37]:
"Five miles from Edinburgh city centre, Dalkeith Country Estate is home to some beautiful woodland, with bluebell walks, riverside trails, cycle tracks and picnic areas for families to enjoy. There’s also the excellent Fort Douglas Adventure Playground, with giant slides, tree top walkways, rope swings and its famous flying fox zip slide.\n\nOpened in July 2016, the brand new Dalkeith Country Park is an experience unlike any other. You can find the magical new Fort Douglas Adventure Playground alongside Restoration Yard, that holds The Kitchen Restautrant, the store and wellbeing lab wellbeing lab in the former stableyard which has been lovingly restored to create a truly special day out.\n\nThere's also the wider park to explore with waymarked walking and cycling trails to suit the whole family and special events too. Explore the Old Oak Wood with trees over 900 years old, enjoy a picnic in one of the areas we've created, or simply breathe in the fresh air of this beautiful country park, which you'll find hard to believe is just a few miles from Edinburgh's city centre."

6. Topic Modelling

In [32]:
topic_model = BERTopic(min_topic_size=10).fit(d, text_embeddings)
topics, probs = topic_model.transform(d, text_embeddings)
topic_model.visualize_topics()
In [33]:
topic_model.visualize_barchart()
In [34]:
topic_model.visualize_heatmap()
In [35]:
topic_model.get_topic_freq()
Out[35]:
Topic Count
0 -1 81
1 0 45
2 1 37
3 2 36
4 3 23